DRAFT add Data Health Checker #1574

benheckmann · 2023-06-28T12:50:48Z

Description

First draft for a data health checker as discussed in #854. The checker receives a path to the data in CSV or qlib format (not implemented yet). It will convert the data to a DataFrame and perform basic checks for data completeness and correctness.

I am not too familiar with the qlib data handling yet, so I am hoping to get some first feedback on whether this goes in the right direction.

Motivation and Context

See #854. This was an issue where a user would get a non-meaningful error message when his data did not adhere to the format (specifically the "volume" column was named "vol"). When checking the data of #854 with this checker, the user would get:

[...]
ERROR:root:002645.SZ.csv: Missing columns ['volume'] of required columns ['open', 'high', 'low', 'close', 'volume'].
WARNING:root:002645.SZ.csv: Missing 'factor' column, trading unit will be disabled.

Summary of data health check (4220 files checked):
-----------------------
Problem                   Count  Affected columns
MISSING_REQUIRED_COLUMN   4220   {'volume'}
MISSING_DATA              0      -
LARGE_STEP_CHANGE         14     {'low', 'open', 'close', 'high'}
MISSING_FACTOR            4220   {'factor'}

Note: the large step change uses two configurable thresholds (one for price and one for volume) and checks only step changes in OHLCV columns.

How Has This Been Tested?

No tests yet as this is only a first draft

Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Pipeline test:
Your own tests:

Types of changes

Fix bugs
Add new feature
Update documentation

benheckmann · 2023-06-29T07:55:02Z

@microsoft-github-policy-service agree

…r check, reformatted summary

Fivele-Li · 2023-08-08T04:40:18Z

Add unit tests in qlib.test

github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Jun 28, 2023

benheckmann added 2 commits July 17, 2023 13:08

microsoft#854 implement first data health checker draft

46d33ab

microsoft#854 added support for qlib's data format, implemented facto…

1c7c5d0

…r check, reformatted summary

benheckmann force-pushed the data-health-checker branch from 73bf59e to 1c7c5d0 Compare July 17, 2023 11:10

benheckmann mentioned this pull request Jul 17, 2023

Health checker for Qlib Data #854

Open

SunsetWolf force-pushed the main branch from 702de78 to 194284b Compare May 7, 2024 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT add Data Health Checker #1574

DRAFT add Data Health Checker #1574

benheckmann commented Jun 28, 2023 •

edited

Loading

benheckmann commented Jun 29, 2023

Fivele-Li commented Aug 8, 2023

DRAFT add Data Health Checker #1574

Are you sure you want to change the base?

DRAFT add Data Health Checker #1574

Conversation

benheckmann commented Jun 28, 2023 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

benheckmann commented Jun 29, 2023

Fivele-Li commented Aug 8, 2023

benheckmann commented Jun 28, 2023 •

edited

Loading